RFC#0192 - Ensure workers do not get unnecessarily killed #192

JohanLorenzo · 2024-06-20T10:31:43Z

No description provided.

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

JohanLorenzo · 2024-07-18T13:08:42Z

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

lotas · 2024-07-22T12:17:05Z

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

please bring it up during the next community meeting, would be best! Thanks

JohanLorenzo · 2025-09-26T13:11:33Z

I took another stab at it now that #191 is implemented 😃

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

matt-boris

To me, this problem is solved with minimal code changes to worker manager to allow for a secondary worker config to be used when provisioning workers to fill the minCapacity void. This allows us to just use what's already there.

This could look like allowing a second block of worker config in the worker pool config or even just have worker manager automatically edit the idleTimeoutSecs to 0 and make it an on-demand instance type for the workers they're provisioning to get the capacity to minCapacity.

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

JohanLorenzo · 2025-09-26T17:04:44Z

Thank you very much @lotas and @matt-boris for your insightful input! I like where the RFC is going. It's now much simpler 👍 I believe I addressed all the comments and it's now ready for another round of review. Please let me know if there's anything I missed.

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

lotas · 2025-10-06T13:01:55Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+#### 3. Pool Configuration
+
+**New Configuration Options:**
+Pools will have a new boolean flag `enableMinCapacityWorkers` that enables minCapacity worker behavior. When enabled, workers fulfilling minCapacity requirements will run indefinitely (idle timeout set to 0), providing maximum cache preservation.


I wonder if we really want an extra flag for this. For me minCapacity is already there.
Unless someone wanted to use it the way they were currently working, I think we can just not introduce a new flag.
just thinking out loud

I was thinking of a new flag in order to roll out the changes. But maybe there's another way I didn't think of. I'm open to other options 🙂

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

lotas

I think this proposal is in really good shape from my point of view 👍 thank you

JohanLorenzo · 2025-10-15T08:23:00Z

@jcristau, I remember you had some thoughts on the current behavior. What do you think of this proposal?

petemoore

I understand the motivation for splitting a worker pool into two groups - the workers that stay alive forever, and those that don't, but I see some downsides to this approach. I think the reason this solution is proposed is because currently workers decide themselves when to terminate, rather than being signaled to do by a central authority (e.g. Worker Manager). However, at this point, Worker Manager is now effectively managing termination of some instances and not others, and under this approach, the responsibility is somewhat artificially split, with neither party (Worker Manager and the the worker itself) having control of the entire pool. This leads to some unpleasant consequences, such as worker manager needing to hack the worker config generation (the worker config is currently embedded in the launch config, and will not map to what the worker config actually looks like on the worker, since the idle timeout parameter only applies to a subset of the workers). Note, different worker implementations may also have different config settings for the idle timeout, so Worker Manager needing to understand the config of different worker implementations is also very brittle. It also doesn't allow Worker Manager to be smart about which workers it wants to keep alive, and those that it wants to terminate, since it can only terminate the workers in the "stay alive forever" group, which might be suboptimal (perhaps it is preferential to retain a different subset of workers that are cheaper/more performant). For example, prices might drop for one instance type, but you have workers set to run forever with a different instance type, so you are stuck with the ones you have.

To solve all of this, since Worker Manager is now becoming responsible for managing the termination of some workers, I think it would be better to move the entire responsibility to Worker Manager, and avoid the hacking of the Generic Worker config. I would propose that rather than quarantining the worker (which is intended for a different use case) and then later shutting it down, that a signal of some sort would be sent that signals that the worker should gracefully terminate. This way we do not muddy the Quarantine logs with events that are for terminations, not for quarantines (e.g. auditing real quarantines becomes problematic when the audit logs are full of business-as-usual terminations).

Consider also that technically a worker can perform work for a worker pool even if it wasn't launched by Worker Manager. Worker Manager may know it exists (since the Queue will report the worker as being active) and it may contribute towards the pool size, but maybe Worker Manager does not have a means to terminate it. My point here is that it gets very messy very quickly trying to account for all scenarios in this proposed design. I think it would be much simpler and cleaner instead for Worker Manager to make decisions about all the workers it launches, and move the responsibility to Worker Manager in a worker-implementation-agnostic way (Worker Manager should not need to know what the worker implementation config settings are or how to patch them) and at a worker pool level we should be able to tell Worker Manager "keep this many workers alive" and it should be able to signal to workers it has created that it no longer needs them, whenever it wants, and use this simple mechanism to manage its pool, not just a subset of the workers it has created.

So in summary, I think we should:

When generating worker pool definitions for Generic Worker workers in community-tc-config, fxci-config etc, set idle time to 0 (the config option can continue to be supported by Generic Worker, but just not used in our Worker Manager managed worker pools, as it is potentially still useful for when the worker isn't spawned by Worker Manager).
introduce a signalling mechanism that Worker Manager can use to advise the worker to stop taking new jobs, and to shutdown (similar to quarantine but more explicit that it should be followed by a shutdown). Avoid hijacking the existing call which is really intended for auditing (rare) security incidents, troubleshooting unexpected behaviour etc.
Worker Manager implements some criteria to decide which workers to terminate and which to keep alive when pool size is greater than desired. Initially this could be just keeping the workers around that have been alive the longest, but there could be some scoring mechanism which favours long lived workers but also measures performance, tracks anomolies, or unexpected expired task claims, etc. Maybe there is a cap on maximum lifetime of a worker for security reasons (e.g. don't allow any worker to live longer than 2 weeks) which may also ensure images are not too old (e.g. if the machine image it is built from becomes out of date, it should not be kept indefinitely). There are several reasons why your initial choice about which workers to keep alive indefinitely might change, and having a homogeneous pool where "all workers are equal" in the ability to keep workers alive, gives Worker Manager more power to make smarter decisions, avoid muddying configs, and keeping split of responsibilities clear.

lotas · 2025-11-03T12:18:13Z

If I understand correctly @petemoore, you suggest:

Add a feature to g-w (worker-runner) to ask worker-manager if it should keep on doing work or should terminate itself (as soon as appropriate)
let worker-manager scale down automatic pools.

Aren't they competing solutions?

I like the idea of making worker manager not aware of worker configs, probably less error-prone in the long run 👍

I think there could be edge cases on both sides of those solutions, for example, if we make all workers long-lived ones, and worker-manager fails in some way to provision/scan (like expired credentials or internal error), which would result in many workers running way longer than they should.

Johan's initial proposal is probably a good temporary fix to let some workers live longer, but implementing some sort of signaling mechanism where workers talk to worker manager to know if they should keep going instead of shutting down might be better in the long run.

petemoore · 2025-11-03T13:35:48Z

I wrote a lot of words, but in summary, I think I’m suggesting the approach is ok in principle, but should be for the whole pool, instead of splitting it in two.

This way, Worker Manager is free to decide which instances to terminate, based on its own conditions, which may change over time.

With Worker Manager managing the termination of workers, we can either turn off the feature on the workers directly in the Generic-Worker config, or have some longer fallback period, to catch situations like the one you highlight.

Regarding how Worker Manager communicates with the worker to ask it to gracefully shutdown, I am reasonably agnostic. Perhaps the quarantine followed by a termination isn't the worst thing, although it doesn't really allow workers to perform any cleanup (e.g. imagine in future that workers might want to e.g. preserve their caches by writing them to a storage bucket before terminating, or securely erasing secrets, or writing an activity report somewhere, reporting lifecycle events to some audit trail, or any other teardown type activity that might be useful...).

In any case, I think any time Worker Manager decides it no longer requires an instance, there should be some way to signal it to a worker, so that it can gracefully terminate, but Worker Manager should keep track of the instance and forcefully kill it after some timeout, if the worker hasn't successfully terminated itself.

lotas · 2025-11-03T13:40:07Z

Worker manager already can removeWorker() which will tell instance to shutdown, and worker-runner/generic-worker would get notified when this happens. But I guess it comes without any warning to the instance that is about to shutdown - something like spot termination is doing where you can potentially upload stuff before shutting down.

If there was some similar way to send SIGTERM signal to a vm and let g-w wrap things up before getting hard kill event that would make it easier probably

JohanLorenzo · 2025-11-05T10:49:10Z

Thank you for your input, @petemoore! I agree having a central place like worker-manager makes more sense and avoid duplication of responsibilities 👍 I just added a phase 2 to cover that. For 2025, I'd like us to focus on phase 1, which is leveraging idleTimeoutSecs=0

For example, prices might drop for one instance type, but you have workers set to run forever with a different instance type, so you are stuck with the ones you have.

For the record, this use case doesn't exist in fxci. yq '.pools[] | select((.config.instance_types | length) > 1) | .pool_id' worker-pools.yml doesn't return anything.

In community, there are just 2 pools: git-cinnabar/windows and relman/win2022. The former is not supported anymore now that Firefox moved to https://github.com/mozilla-firefox/firefox. The latter is then the only exception. I believe it'd be better to split this pool per instance type rather than trying to find a design because of it. What do you think? yq command for community.

yq '
    (filename | sub(".*/", "") | sub(".yml$", "")) as $file |
    keys[] as $proj |
    .[$proj].workerPools // {} |
    to_entries[] |
    select((.value.instanceTypes // {} | length) > 1) |
    $file + ": " + $proj + "/" + .key
  ' config/projects/*.yml

This way we do not muddy the Quarantine logs with events that are for terminations, not for quarantines

TIL we have specific quarantine logs. Where I can find them?

Consider also that technically a worker can perform work for a worker pool even if it wasn't launched by Worker Manager.

Do we have a way to measure how often this happens? I'm under the impression that it rarely occurs on Firefox CI because spawning a new instance is too complex and most people don't have enough permissions to spawn their own worker. I might be wrong though. I'd love to see if what it's really like.

petemoore · 2025-11-05T13:15:15Z

Thank you for your input, @petemoore! I agree having a central place like worker-manager makes more sense and avoid duplication of responsibilities 👍 I just added a phase 2 to cover that. For 2025, I'd like us to focus on phase 1, which is leveraging idleTimeoutSecs=0

For example, prices might drop for one instance type, but you have workers set to run forever with a different instance type, so you are stuck with the ones you have.

For the record, this use case doesn't exist in fxci. yq '.pools[] | select((.config.instance_types | length) > 1) | .pool_id' worker-pools.yml doesn't return anything.

In community, there are just 2 pools: git-cinnabar/windows and relman/win2022. The former is not supported anymore now that Firefox moved to https://github.com/mozilla-firefox/firefox. The latter is then the only exception. I believe it'd be better to split this pool per instance type rather than trying to find a design because of it. What do you think? yq command for community.
yq '
    (filename | sub(".*/", "") | sub(".yml$", "")) as $file |
    keys[] as $proj |
    .[$proj].workerPools // {} |
    to_entries[] |
    select((.value.instanceTypes // {} | length) > 1) |
    $file + ": " + $proj + "/" + .key
  ' config/projects/*.yml

This was just an example of one scenario where Worker Manager at runtime may have reason to prefer keeping one worker alive over another, from information not available at the time it launched the "run forever" instance. The point being, if the a matter is decided in worker config, it is locked in and non-changeable as soon as the instance is spawned.

This way we do not muddy the Quarantine logs with events that are for terminations, not for quarantines

TIL we have specific quarantine logs. Where I can find them?

I'm not sure if we have quarantine logs per se, but the Queue provides an API to query quarantine history.

Consider also that technically a worker can perform work for a worker pool even if it wasn't launched by Worker Manager.

Do we have a way to measure how often this happens? I'm under the impression that it rarely occurs on Firefox CI because spawning a new instance is too complex and most people don't have enough permissions to spawn their own worker. I might be wrong though. I'd love to see if what it's really like.

Yes I doubt it often happens, it was more an observation about the limitations of the proposed design. There is no technical restriction from being able to supplement a Worker Manager worker pool with additional e.g. hardware workers. Now imagine the minimum pool size is set to 5 and Worker Manager starts up, and creates 5 live-forever workers. At some point, someone decides it is better for the 5 workers that live forever to be hardware workers, for whatever reason. They create 5 hardware workers that live forever. Worker Manager now sees there are 10 workers running, but no tasks, and the minimum worker pool size is 5. So ideally it would terminate the 5 it started up. Of course, an operations staff member can also delete them manually, so no big deal. It is just one extra thing for a person to mentally track.

However, given that the design as proposed is relatively straightforward to implement, maybe we can go with that for now. Worker Manager should not need to know about Generic Worker config settings, but it is a hack we can live with, if we can agree to remove it in the future. It really should be part of the protocol of interaction between worker and worker manager.

Maybe a slightly cleaner solution would be when the worker registers, it gets a worker manager defined parameter about whether it should stay alive forever, or what its idle timeout should be. At this point, worker runner or the worker can translate it into its own config setting.

JohanLorenzo · 2025-11-05T13:24:23Z

Agreed. Let's go with the proposed design for now and revisit the day we decide to mix hardware workers among cloud-based ones.

JohanLorenzo · 2025-11-07T17:59:24Z

Discussed with the whole Taskcluster team at today's community call. We agreed the RFC is in good shape for entering the Final Comment phase.

jcristau · 2025-11-07T14:01:52Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+   - Package manager caches (npm, pip, cargo, etc.)
+   - Container images and layers
+
+2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait


IMO the average is not the issue here, it's that whenever something goes slightly wrong, or it's a slightly less often used pool, it can take tens of minutes.

jcristau · 2025-11-13T10:36:29Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced:
+- If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it
+- Terminate these workers via cloud provider APIs after checking for running tasks


How does that work? Do we have a way to signal to the worker that it shouldn't claim any more tasks? Or signal to the queue that it shouldn't give it any? Or do we risk killing a running task?

Worker-manager is able to terminate cloud instances (e.g.)

In its current state, the RFC risks killing a running task. In a previous revision, we thought of quarantining the worker. However @petemoore raised the concern that it would bend the purpose of the quarantine mechanism.

So I think what we are missing here is the guarantee that worker is not running any tasks and the queue will not give out new tasks.
Maybe we need some extra mechanism here in addition to quarantine mechanism, that would signal queue to mark worker as "scheduled for removal" which would essentially do the same but have clear signal that we are about to decommission this worker.

Could be a short-lived queue (~20min same as claim timeout)

jcristau · 2025-11-13T10:41:18Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+
+Worker-manager enforces idle timeout using existing [`queueInactivityTimeout` from lifecycle configuration](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/schemas/v1/worker-lifecycle.yml#L33-L50) and through idleTimeout (as specified in phase 1).
+
+[Scanner polls](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) Queue API to track worker idle time:


Do we expect this to scale well? (We already have issues with scanner loops taking too long.)

@lotas, what do you think?

yeah, we probably don't want to kill queue api with those increased loads of queries.

Instead, I think we should build some state inside w-m service.
One idea could be to listen to queue task related events and update some form of cache or state with the following details:

workerId

lastTaskId

lastTaskStartedAt

lastTaskResolvedAt

this way we'll simply leverage existing rabbitmq exchanges, and would know the approximate state of the worker without ddosing queue service

jcristau · 2025-11-13T10:42:06Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+**Migration:**
+Deploy as breaking change requiring simultaneous worker-manager and generic-worker updates.


This seems like a fairly risky strategy, best avoided...

@petemoore, could we have a way for generic-worker to support both idle-termination and termination from worker-manager for a transition period?

jcristau · 2025-11-13T10:44:21Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+- Launch configuration changed/archived
+- Worker is unhealthy (provider-specific check)
+
+All terminations check for running tasks before proceeding.


As in phase 1, care will be needed here to avoid races between termination and claimWork, it would be good to flesh out that process more.

@jcristau What would you recommend?

lotas · 2025-11-13T13:41:34Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+
+**Removing Excess Capacity:**
+
+When `runningCapacity > minCapacity`, [worker-manager scanner identifies](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) and terminates excess workers.


As it currently stands, worker-scanner doesn't remove workers when capacity changes, it only updates db state.

Are we talking here about the case when minCapacity was lowered and we need to reduce the number of minCapacity-workers provisioned earlier?

lotas · 2025-11-13T13:45:48Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+When `runningCapacity > minCapacity`, [worker-manager scanner identifies](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) and terminates excess workers.
+
+**Termination logic:**
+- Query [Queue API client](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/main.js#L176-L178) to check if worker's latest task is running


this will likely not guarantee anything (workers currently polling for work), but would increase load on the queue service api.

Assuming we want to implement this active workers termination routines already, I would propose to keep it really simple, and only select the desired amount of oldest workers to be terminated

lotas · 2025-11-13T13:58:44Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+
+**Launch Configuration Changes:**
+When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced:
+- If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it


this is actually a good point, and probably should be solved separately. Launch configs will be archived if something changes, and it's a good question if they should be killed immediately or not. Because if there was a minor change, like zone removed, do we want to let it work until it idles and shuts down, or we want to kill all such workers right away

lotas · 2025-11-13T14:25:45Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+Worker-manager terminates workers when:
+- Idle timeout exceeded (`queueInactivityTimeout`)
+- Capacity exceeds `maxCapacity`
+- Capacity exceeds `minCapacity` (terminate oldest first)


wait, wouldn't it contradict the whole point of min capacity workers? If we'll always be killing oldest, those with caches would be gone first

JohanLorenzo · 2025-11-19T17:46:16Z

Discussed in person with @lotas, @petemoore, @hneiva. We decided to take a different and more generic approach. We decided to close this current RFC and to create a new one with the new proposal. The main reason for this is: the new approach requires a full rewrite of the RFC, so the comments in this current RFC will create more noise than anything.

JohanLorenzo requested a review from a team as a code owner June 20, 2024 10:31

JohanLorenzo requested review from lotas, matt-boris and petemoore and removed request for a team June 20, 2024 10:31

JohanLorenzo changed the title ~~RFC#0192 - ensures workers do not get unnecessarily killed~~ RFC#0192 - Ensure workers do not get unnecessarily killed Jun 20, 2024

JohanLorenzo force-pushed the rfc-192 branch from a90eb39 to 21f1df3 Compare June 20, 2024 12:12

lotas reviewed Jun 20, 2024

View reviewed changes